From Counting Words to Creating Worlds: How NLP Became Generative AI

1. Previous Important algorithms IA

N-gram Models - 1950s: Predict the next word based on the previous ones (e.g., last 2 words in a bigram).
Bag of Words - 1950s: A simple model that represents text as a set of words without considering grammar or word order.
Cosine Similarity - 1958: Measures how similar two vectors are by the angle between them.
Formula
$\text{cosine\_similarity} = \frac{\vec{A} \cdot \vec{B}}{ \| \vec{A} \| \| \vec{B} \| }$
Where:
- $\vec{A}$ is the first vector.
- $\vec{B}$ is the second vector.
- $\vec{A} \cdot \vec{B}$ is the dot product of vectors A and B.
- $\|\vec{A}\|$ is the magnitude (norm) of vector A.
- $\|\vec{B}\|$ is the magnitude (norm) of vector B.
TF-IDF (Term Frequency-Inverse Document Frequency) - 1970s: Weighs words based on their frequency in a document and their rarity in the corpus, helping to identify important terms.
Word Embeddings - 2013: Convert words into vectors that capture their meaning and relationships.

2. Recurrent Neural Network RNN

It is designed to process sequential data. RNNs maintain a hidden state that is updated at each time step, allowing the model to remember information from previous inputs. The basic equation is:

h_t = \text{activation}\left( W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b \right)

Where:

$h_t$ : The hidden state at time $t$ .
$x_t$ : The input at time $t$ .
$W_{xh}$ : The weight matrix connecting the input ( $x_t$ ) to the hidden state ( $h_t$ ).
$W_{hh}$ : The weight matrix connecting the previous hidden state ( $h_{t-1}$ ) to the current hidden state ( $h_t$ ).
$b$ : The bias term added to the weighted sum.
activation: An activation function, such as tanh, sigmoid, or ReLU, applied to the sum.

RNNs form the foundation for many modern AI applications by enabling the processing of sequences with context. But they have limitations in remembering long-term information due to the vanishing gradient.

3. Long Short-Term Memory (LSTM)

Designed to improve memory from an RNN, it decides what to remember using 3 gates (forget, input, and output), maintains a cell state for long-term memory, and updates its state using a candidate cell and a hidden value. This improvements make this achievement more slowly

\begin{bmatrix} f_t \\ i_t \\ o_t \\ \tilde{c}_t \end{bmatrix} = \begin{bmatrix} \sigma \\ \sigma \\ \sigma \\ \tanh \end{bmatrix} \left( W \begin{bmatrix} x_t \\ h_{t-1} \end{bmatrix} + b \right)

Where:

$x_t$ is the input at time step $t$
$h_{t-1}$ is the previous hidden state
$W$ is the combined weight matrix for all gates and candidate
$b$ is the bias vector
$\sigma$ is the sigmoid activation function
$\tanh$ is the hyperbolic tangent activation
$f_t$ is the forget gate
$i_t$ is the input gate
$o_t$ is the output gate
$\tilde{c}_t$ is the candidate cell state
$c_t$ is the current cell state
$c_{t-1}$ is the previous cell state
$h_t$ is the current hidden state
$\circ$ represents element-wise (Hadamard) multiplication

Forget Gate

The forget gate decides which information from the previous cell state should be discarded. It is calculated as:

$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$

Where:

$f_t$ : The forget gate output at time $t$ .
$W_f$ : Weight matrix for the forget gate.
$h_{t-1}$ : The hidden state at the previous time step.
$x_t$ : The input at time $t$ .
$b_f$ : The bias term for the forget gate.
$\sigma$ : The sigmoid activation function.

Input Gate

The input gate decides which values from the current input and the previous hidden state will update the cell state. It is calculated as:

$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$

Where:

$i_t$ : The input gate output at time $t$ .
$W_i$ : Weight matrix for the input gate.
$h_{t-1}$ : The previous hidden state.
$x_t$ : The current input.
$b_i$ : The bias term for the input gate.
$\sigma$ : The sigmoid activation function.

Output Gate

The output gate decides what the next hidden state will be, based on the current cell state. It is calculated as:

$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$

Where:

$o_t$ : The output gate output at time $t$ .
$W_o$ : Weight matrix for the output gate.
$h_{t-1}$ : The previous hidden state.
$x_t$ : The current input.
$b_o$ : The bias term for the output gate.
$\sigma$ : The sigmoid activation function.

Candidate Cell State

C_t = \tanh(W_{ci} [h_{t-1}, x_t] + b_c)

Where:

$C_t$ is the candidate cell state at time step $t$ .
$W_{ci}$ is the weight matrix for the candidate cell state.
$h_{t-1}$ is the previous hidden state.
$x_t$ is the current input.
$b_c$ is the bias term for the candidate cell state.

Cell State Update

C_t = f_t * C_{t-1} + i_t * \tilde{C}_t

Where:

$C_t$ is the current cell state at time step $t$ .
$f_t$ is the forget gate’s output at time step $t$ .
$C_{t-1}$ is the previous cell state.
$i_t$ is the input gate’s output at time step $t$ .
$\tilde{C}_t$ is the candidate cell state.

Hidden State Update

h_t = o_t * \tanh(C_t)

Where:

$h_t$ is the hidden state at time step $t$ .
$o_t$ is the output gate’s output at time step $t$ .
$C_t$ is the current cell state at time step $t$ .

4. Gated Recurrent Unit (GRU)

Designed to simplify the LSTM by using only an update gate and a reset gate, making it more efficient. It merges cell and hidden states into one, reducing complexity while keeping strong performance.

\begin{aligned} z_t &= \sigma(W_z x_t + U_z h_{t-1} + b_z) \\ r_t &= \sigma(W_r x_t + U_r h_{t-1} + b_r) \\ \tilde{h}_t &= \tanh(W_h x_t + U_h (r_t \circ h_{t-1}) + b_h) \\ h_t &= (1 - z_t) \circ h_{t-1} + z_t \circ \tilde{h}_t \end{aligned}

Where:

$x_t$ is the input at time t.
$h_{t-1}$ is the previous hidden state.
$z_t$ is the update gate (controls how much of the past to keep).
$r_t$ is the reset gate (controls how much past info to forget).
$\tilde{h}_t$ is the candidate hidden state.
$h_t$ is the new hidden state.
$\circ$ is element-wise multiplication.
$W_*$ , $U_*$ , $b_*$ are weights and biases.
$\sigma$ is the sigmoid function.

Then, there were other improvements, like processing the sequence of data from right to left and in both directions. Finally, a scoring mechanism was introduced to compare the last element with the sequence and understand the importance of each word. This improved the handling of sequence dependencies, addressing the memory problem (which is the basis of transformers).

5. Transformers

Input: a sequence of words (tokens)
Embedding: transform the words into numerical representations, using methods like Word2Vec or learned embeddings
Positional Encoding: adds additional information to help the model understand the order of the words
Attention Mechanism: calculates the importance of each word with respect to the others. The calculation is done in parallel — for every word, its attention to other words is computed. So, if there are two similar words with different meanings, they will have different attention values (this is why transformers can understand context and meaning) $\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$ Where:
- Q is the query (the current word being processed)
- K is the key (represents all words)
- V is the value (information of all words)
- d_k is the dimension of the key vectors
Feedforward: a small layer that adjusts each vector (refines the representation), applied to each word independently
Normalization: the result is normalized to avoid issues like vanishing gradients
Decoder: after several iterations, the attention results are transformed back into words

How is this used?

After training, the model holds a matrix of attention scores. You input a sentence, and based on these learned weights, the model predicts the next word step by step.

Jimy Nicanor

elVengador

From Counting Words to Creating Worlds: How NLP Became Generative AI

1. Previous Important algorithms IA

2. Recurrent Neural Network RNN

3. Long Short-Term Memory (LSTM)

4. Gated Recurrent Unit (GRU)

5. Transformers

How is this used?